Skip to content

add FanOutMapper for one-to-many partition fan-out#66030

Draft
Lee-W wants to merge 17 commits intoapache:mainfrom
astronomer:partition-fanout
Draft

add FanOutMapper for one-to-many partition fan-out#66030
Lee-W wants to merge 17 commits intoapache:mainfrom
astronomer:partition-fanout

Conversation

@Lee-W
Copy link
Copy Markdown
Member

@Lee-W Lee-W commented Apr 28, 2026

Why

Roll-up partitioning currently handles N→1 (many upstream keys feeding one downstream run), but the symmetric 1→N case — one weekly key fanning out into seven daily Dag runs — has no first-class mapper. Authors had to hand-roll fan-out logic per Dag, which made the unbounded-keys footgun easy to hit.

closes: #65654

What

  • Add FanOutMapper composing upstream_mapper + window + downstream_mapper, mirroring RollupMapper's shape.
  • Resolve the default downstream mapper from the Window class name so SDK and core Window types both map to the same default (WeekWindow → StartOfDayMapper, etc.).
  • Add [scheduler] partition_fanout_max_keys (default 1000) capping downstream keys per upstream event; over-limit fan-outs are skipped and logged against the source task instance.
  • Wire FanOutMapper through assets/manager.py and serialization/encoders.py.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: [Claude] following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:ConfigTemplates area:DAG-processing area:Scheduler including HA (high availability) scheduler area:task-sdk area:UI Related to UI/UX. For Frontend Developers. labels Apr 28, 2026
@Lee-W
Copy link
Copy Markdown
Member Author

Lee-W commented Apr 28, 2026

only the last commit matters.

@Lee-W Lee-W force-pushed the partition-fanout branch 17 times, most recently from e4d0931 to dd3408e Compare May 8, 2026 08:43
Lee-W added 11 commits May 8, 2026 20:30
…t ordering

StartOfWeekMapper and StartOfQuarterMapper now derive their decode_downstream
regex from output_format itself, so users can re-order strftime directives
and {name} placeholders (e.g. "Q{quarter}/%Y") without having to override
decode_downstream. Malformed output_format — empty {}, non-identifier
placeholder names, duplicate %X directives, duplicate {name} placeholders —
raises ValueError at mapper construction instead of an opaque re.error from
deep inside a scheduler tick or UI route.
…ag_runs list

Drop the SQL "count distinct assets with any log" subquery and always
compute total_received via the Python rollup-aware helper. The list
endpoint previously returned different numbers for the same APDR
depending on whether the caller filtered by dag_id (rollup-aware,
counts upstream window keys) or queried globally (SQL approximation,
counts assets with any log) — same field, different semantics, very
confusing for any UI consumer.

The N+1 cost of per-Dag timetable loads was already paid in the
global branch for total_required, so adding a single batched log
fetch keeps the existing query budget while making the contract
identical across both views. _compute_received_count now skips
asset_ids that are no longer required (active=False) so the relaxed
log query doesn't over-count.
StartOfWeekMapper now always uses ISO weeks (Monday) and
StartOfMonthMapper always emits the 1st of the month. Custom
fiscal boundaries can still be expressed by pairing a user-defined
source mapper with the existing windows.
The next_run_assets and partitioned_dag_runs endpoints used to load
and deserialize the full timetable on every request just to read
mapper attributes (is_rollup) and required-key counts. Cache mapper
metadata per asset on DagModel during Dag sync via a new
``partition_mapper_info`` JSON column, so the UI resolves mapper
attributes from the cache and only loads the timetable when
``to_upstream`` evaluation for rollup mappers is actually needed.
Composes upstream_mapper + window + (optional) fine_mapper, symmetric
to RollupMapper. New [scheduler] partition_fanout_max_keys caps the
downstream keys per upstream event.
@Lee-W Lee-W force-pushed the partition-fanout branch from dd3408e to 3b5dee1 Compare May 8, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:ConfigTemplates area:DAG-processing area:Scheduler including HA (high availability) scheduler area:task-sdk area:UI Related to UI/UX. For Frontend Developers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement fan-out (one-to-many partition mapper)

1 participant